Magiclink support for custom autolinked references #2550

jklukas · 2024-12-17T02:20:34Z

Initial implementation of customrefs as proposed in #2549 to keep discussion going. Happy to refactor, add further tests, and to discuss details based on this content.

TODO:

Documentation

jklukas · 2024-12-17T12:54:11Z

pymdownx/magiclink.py

@@ -1036,6 +1045,33 @@ def handleMatch(self, m, data):
        return el, m.start(0), m.end(0)


+class MagiclinkCustomRefPattern(InlineProcessor):
+    """Return a link Element given a custom prefix."""


TODO: better docstring here

jklukas · 2024-12-17T12:56:17Z

pymdownx/magiclink.py

+    def __init__(self, pattern, md, shortname, target_url):
+        """Initialize."""
+
+        self.shortname = shortname


TODO: better term than "shortname". This is the lower-cased version of the prefix that we inject into a class name and inject into the name we use when registering the inlineProcessor.

jklukas · 2024-12-17T13:14:08Z

tests/test_extensions/test_magiclink.py

+        )
+
+    def test_underscores(self):
+        """Test underscore counts as a word character."""


We originally discussed supporting "alphanumeric" identifiers, but re's \w pattern for "word characters" includes underscore, and that definitely feels like a useful addition. But then in the next test here we validate that hyphens break the pattern, but I could see hyphens being useful as well.

These two test cases are validating "go/ link"-style links which would benefit from supporting / and # in identifiers as well. I've worked in two companies that provide a go/ link service, and it's fairly normal to see a link that includes a subpath and an anchor like go/myproject/starting#authentication.

So, we could consider expanding the set to "alphanumeric characters, plus _@/#-". That excludes periods and commas, which would often be word boundaries.

Or, we could make this feature significantly more generic and powerful by letting users specify regex patterns themselves. And we'd rely on documentation to give users solid examples for the simpler cases. I'm interested in your perspective on the user base here and whether they'd generally prefer to have the power+responsibility of a more generic interface. GitHub's autolink reference feature needs to run whenever rendering the UI, but magiclinks is generally running as a static compiling step, so I expect we have more budget for complexity here.

To be more specific, when I said alphanumeric, I was thinking [a-z0-9]+ (assuming case insensitive matching). This is a common restriction for these types of identifiers. As a matter of fact, Jira imposes these same restrictions (Jira being the example used when this feature was requested). Additionally, Jira for instance does allow _ underscores ([a-z0-9_]+). Originally, my gut was telling me to restrict this to only keys that start with a letter, another common restriction for IDs of this type, and yes Jira does this as well ([a-z][a-z0-9_]+).

Now that I've had some more time to think, I do think in general ([a-z][a-z0-9_]+) is a fine restriction and one that we should adopt for now if we want to continue forward with this approach. This can always be relaxed assuming we find practical cases that require such.

I think each proposed ID should be validated with some regex such as this.

These two test cases are validating "go/ link"-style links which would benefit from supporting / and # in identifiers as well. I've worked in two companies that provide a go/ link service, and it's fairly normal to see a link that includes a subpath and an anchor like go/myproject/starting#authentication.

So, we could consider expanding the set to "alphanumeric characters, plus _@/#-". That excludes periods and commas, which would often be word boundaries.

I want to be careful here as we are now running into scope increase. The original feature discussed was simply a method of providing a short name to translate into a link, now we are talking about mirroring various different issue tracking systems, and I'm not sure that is something I want or am willing to do.

I'm not saying there isn't value in such suggestions, but at some point, I wonder if a user has more complicated auto-linking requirements if it would be better that they just write a specialized plugin at that point. Regardless, for now, I'd like to keep the scope reasonable or we need to go back and decide what it is we really want to do so that we can approach it the right way. Depending on what the end goal is, it may greatly affect how I'd like to go about the implementation.

If there is a bigger vision here we need to take a step back and define what our goals are, and maybe this PR is not the answer. Maybe we need a different approach.

If we want a more robust system for various repos that is similar to our current support, then we may need to consider refactoring and generalizing the existing system to make GitHub, Bitbucket, and GitLab handling even more generic so that Jira or any other system can be plugged in (within reason).

Different systems have different username, project, and issue requirements. Some may be more difficult to identify; as patterns become more complex, so do false-positive matching and conflicts with other matching. So there are going to be some restrictions of what we can reasonably do in a generic sense. Maybe we can't always support every kind of link like we do for GitHub in Jira (or some other service).

For instance, some systems may use the user email as the identifier in the URL, but may also display the user name in other aspects. Without the server making such translations and us tapping into their API, that is impossible for us. Also, we already translate emails into email links.

When I was focused on the 3 big free services, the intent was that those would be all I was supporting to aid the majority of the open-source user base. They were generally kind of similar and monkied each other in various ways. Corporate installs of these often are allowed more freedom to do things a bit differently. I know our corporate username requirements for Bitbucket is different than that of the open-source version of the site. When I developed this extension, I wasn't targeting the corporate world specifically.

I'm interested in your perspective on the user base here and whether they'd generally prefer to have the power+responsibility of a more generic interface.

My thoughts are pulled between two things. I do like to give the user power but I also like to keep my maintenance cost (in regards to time) at a reasonable level as well. So my decisions are often influenced by these two things.

In short, if there is a bigger vision beyond a simple short name to URL mapping, let's iron out what it is we want and are willing to do so that we can decide on a sustainable approach. If what we are doing now is too limited, we need to take a step back and reconsider now before we implement something that we want to rewrite later.

This is great feedback and I appreciate the depth.

If there is a bigger vision here we need to take a step back and define what our goals are, and maybe this PR is not the answer. Maybe we need a different approach.

I will think about this. I agree with you that we either proceed with a simple feature that allows only [a-z0-9_]+ or we scrap this and define a broader project.

I do think it's worth moving forward with a fairly restrictive model of user-defined autolink references. I've updated the feature name here to "simplerefs" to better capture the spirit of this model.

Now that I've had some more time to think, I do think in general ([a-z][a-z0-9_]+) is a fine restriction and one that we should adopt for now if we want to continue forward with this approach. This can always be relaxed assuming we find practical cases that require such.

I'm not entirely clear whether you're referring to the ref_prefix or the captured identifier part of this, so let's unpack those separately.

For ref_prefix values, we would need to allow at least - to allow the Jira case (JIRA-123) since the prefix part ends in a dash. Github's autolink references feature doesn't document any restrictions on the configured prefixes (other than warning the user not to define overlapping prefixes); it supports go/ as a prefix, for example. I think it would be reasonable to require these start with a letter. So I think we should allow at least [a-z][a-z0-9_-/]+ for prefixes.

For captured identifiers, we need to allow pure numeric values to support the Jira case, so can't require it starts with a letter. So this could be straight alphanumeric [a-z0-9] but I think we should support underscores [a-z0-9_] as well.

I do think it's worth moving forward with a fairly restrictive model of user-defined autolink references. I've updated the feature name here to "simplerefs" to better capture the spirit of this model.

Then let's take a step back and let's try and create a proposal spec of what it is we want. My intention was never to support every repo in existence, and the plugin is fairly complicated as is. So when accepting new features, I have to consider that I will be maintaining the solution long after the PR is developed and accepted.

Or maybe I am misunderstanding what you mean by this statement?

For captured identifiers, we need to allow pure numeric values to support the Jira case, so can't require it starts with a letter. So this could be straight alphanumeric [a-z0-9] but I think we should support underscores [a-z0-9_] as well.

Numbers are captured separately from prefixes. Numbers are numbers and I'm not talking about numbers. Project3- is a prefix so Project3-234 would be issue 234 of Project3. We can include - as explicit captures, but I don't think these prefixes should start with -, _, or numbers. I may be persuaded to allow numbers at the start, but not - and _.

I imagine something like this:

import re RE_VALIDATE = re.compile(r'^[a-z][-a-z0-9_]+$') patterns = [ 'test1', 'test-2', '-test3', 'test4-' ] validated = '|'.join(re.escape(p) for p in patterns if RE_VALIDATE.match(p)) pattern = re.compile(fr"\b({validated})([0-9]+)\b") BUFFER = """ This is a test to find things like test-21234, test11234, and test4-1234, but not -test31234. """ for found in pattern.findall(BUFFER): print(found)

('test-2', '1234') ('test1', '1234') ('test4-', '1234')

There are a number of questions that come to mind looking at this. Do we want normalize the final results to all uppercase? I know we talked about matching them case insensitive, but should we normalize case on display in the final document?

Should we have an implied - between numbers and prefixes so the user doesn't have to explicitly define one. I don't know that short names that omit such a separator is preferred unless we just want to mimic other systems even though it is visually suboptimal.

Do we want normalize the final results to all uppercase?

No, I think we should preserve the casing of the text as written in the document. Assuming uppercase feels pretty Jira-specific.

Should we have an implied - between numbers and prefixes

No, I think ending punctuation like that should be part of the configured prefix, as it is with GitHub's autolink references interface.

Numbers are numbers and I'm not talking about numbers

Lol. I think we should call the captured "number" an "identifier" (so <id> rather than <num>) to avoid this confusion. The "identifier" is the variable part that is captured after the prefix.

So we end up capturing text in the form {prefix}{identifier}. We output exactly the text that was captured, but wrap it in <a>.

It might make sense for me to go ahead and add proposed documentation in this PR and have that serve as the proposal spec.

It might make sense for me to go ahead and add proposed documentation in this PR and have that serve as the proposal spec.

I'm okay with that as I want to understand what the actual proposal is opposed to seeing evolve through iterations of commits.

jklukas · 2024-12-19T14:11:38Z